33 research outputs found

    Image editing and interaction tools for visual expression

    Get PDF
    Digital photography is becoming extremely common in our daily life. However, images are difficult to edit and interact with. From a user's perspective, it is important to interact freely with the images on his/her smartphone or ipad. In this thesis we develop several image editing and interaction systems with this idea in mind. We aim for creating visual models with pre-computed internal structures such that interaction is readily supported. We demonstrate that such interactable models, driven by a user's hand, can render powerful visual expressiveness, and make static pixel arrays much more fun to play with. The first system harnesses the editing power of vector graphics. We convert raster images into a vector representation using Loop's subdivision surfaces. An image is represented by a multi-resolution feature-preserving sparse control mesh, with which image editing can be done at semantic level. A user can easily put a smile on a face image, or adjust the level of scene abstractness through a simple slider. The second system allows one to insert an object from image into a new scene. The key is to correct the shading on the object such that it goes consistently with the scene. Unlike traditional approach, we use a simple shape to capture gross shading effects and a set of shading detail images to account for visual complexities. The high-frequency nature of these detail images allows a moderate range of interactive composition effects without causing alarming visual artifacts. The third system is on video clips instead of a single image. We proposed a fully automated algorithm to creat

    Learning Structured Inference Neural Networks with Label Relations

    Full text link
    Images of scenes have various objects as well as abundant attributes, and diverse levels of visual categorization are possible. A natural image could be assigned with fine-grained labels that describe major components, coarse-grained labels that depict high level abstraction or a set of labels that reveal attributes. Such categorization at different concept layers can be modeled with label graphs encoding label information. In this paper, we exploit this rich information with a state-of-art deep learning framework, and propose a generic structured model that leverages diverse label relations to improve image classification performance. Our approach employs a novel stacked label prediction neural network, capturing both inter-level and intra-level label semantics. We evaluate our method on benchmark image datasets, and empirical results illustrate the efficacy of our model.Comment: Conference on Computer Vision and Pattern Recognition(CVPR) 201

    Learning hierarchical poselets for human parsing

    Full text link
    We consider the problem of human parsing with part-based models. Most previous work in part-based models only considers rigid parts (e.g. torso, head, half limbs) guided by human anatomy. We argue that this represen-tation of parts is not necessarily appropriate for human parsing. In this paper, we introduce hierarchical poselets – a new representation for human parsing. Hierarchical poselets can be rigid parts, but they can also be parts that cover large portions of human bodies (e.g. torso + left arm). In the extreme case, they can be the whole bod-ies. We develop a structured model to organize poselets in a hierarchical way and learn the model parameters in a max-margin framework. We demonstrate the superior per-formance of our proposed approach on two datasets with aggressive pose variations. 1

    Improved fuzzy neural network control for the clamping force of Camellia fruit picking manipulator

    Get PDF
    During the operation of the vibrating mechanism, the push-shaking camellia fruit picking manipulator needs to ensure a constant force output of the clamping hydraulic motor in order to make sure that the camellia fruit tree trunk wouldn't loosen or damage, which may affect its later growth, during the picking process. In this regard, this paper derived the state space model of the valve-controlled clamping hydraulic motor system of the push-shaking camellia fruit picking manipulator, and the fuzzy wavelet neural network (FWNN) was designed on the basis of the traditional incremental PID control principle and the parameters of the neural network were optimized by the improved grey wolf optimizer (GWO). And then, the control system was simulated with the MATLAB/Simulink software without and with external interference, and compared and analyzed it with traditional PID controller and fuzzy PID (FPID) controller. The results show that the traditional PID controller and the FPID control have slow response and poor robustness, while the improved fuzzy wavelet neural network PID (IFWNN PID) controller possesses the characteristics of fast response and strong robustness, which can well meet the requirement of the constant clamping force of hydraulic motors. Finally, the field clamping test was carried out on the picking manipulator. The results show that the manipulator controlled by the IFWNN PID controller shortens the clamping time by 20.0% and reduces the clamping damage by 13.6% compared with the PID controller, which is verified that the designed controller can meet the clamping operation requirements of the camellia fruit picking machine

    Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision

    Full text link
    The rapid evolution of Multi-modality Large Language Models (MLLMs) has catalyzed a shift in computer vision from specialized models to general-purpose foundation models. Nevertheless, there is still an inadequacy in assessing the abilities of MLLMs on low-level visual perception and understanding. To address this gap, we present Q-Bench, a holistic benchmark crafted to systematically evaluate potential abilities of MLLMs on three realms: low-level visual perception, low-level visual description, and overall visual quality assessment. a) To evaluate the low-level perception ability, we construct the LLVisionQA dataset, consisting of 2,990 diverse-sourced images, each equipped with a human-asked question focusing on its low-level attributes. We then measure the correctness of MLLMs on answering these questions. b) To examine the description ability of MLLMs on low-level information, we propose the LLDescribe dataset consisting of long expert-labelled golden low-level text descriptions on 499 images, and a GPT-involved comparison pipeline between outputs of MLLMs and the golden descriptions. c) Besides these two tasks, we further measure their visual quality assessment ability to align with human opinion scores. Specifically, we design a softmax-based strategy that enables MLLMs to predict quantifiable quality scores, and evaluate them on various existing image quality assessment (IQA) datasets. Our evaluation across the three abilities confirms that MLLMs possess preliminary low-level visual skills. However, these skills are still unstable and relatively imprecise, indicating the need for specific enhancements on MLLMs towards these abilities. We hope that our benchmark can encourage the research community to delve deeper to discover and enhance these untapped potentials of MLLMs. Project Page: https://vqassessment.github.io/Q-Bench.Comment: 25 pages, 14 figures, 9 tables, preprint versio

    Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models

    Full text link
    Multi-modality foundation models, as represented by GPT-4V, have brought a new paradigm for low-level visual perception and understanding tasks, that can respond to a broad range of natural human instructions in a model. While existing foundation models have shown exciting potentials on low-level visual tasks, their related abilities are still preliminary and need to be improved. In order to enhance these models, we conduct a large-scale subjective experiment collecting a vast number of real human feedbacks on low-level vision. Each feedback follows a pathway that starts with a detailed description on the low-level visual appearance (*e.g. clarity, color, brightness* of an image, and ends with an overall conclusion, with an average length of 45 words. The constructed **Q-Pathway** dataset includes 58K detailed human feedbacks on 18,973 images with diverse low-level appearance. Moreover, to enable foundation models to robustly respond to diverse types of questions, we design a GPT-participated conversion to process these feedbacks into diverse-format 200K instruction-response pairs. Experimental results indicate that the **Q-Instruct** consistently elevates low-level perception and understanding abilities across several foundational models. We anticipate that our datasets can pave the way for a future that general intelligence can perceive, understand low-level visual appearance and evaluate visual quality like a human. Our dataset, model zoo, and demo is published at: https://q-future.github.io/Q-Instruct.Comment: 16 pages, 11 figures, page 12-16 as appendi
    corecore